Introduction to Open Data Science - Course Project

Chapter 1

Thoughts about this course:

I enjoyed the soft landing on this course. It was comforting to start and go through the very basics.
From this course, I expect to learn to integrate the use of github and R markdown in my daily work. I also expect to learn the best practices in open data science. I heard about this course through
the email list of my doctoral program.

Reflections on the learning experience:

So far the book and exercises have been a great crash course to R tools and RStudio.I enjoy the style of writing in the book. My favorite section so far is the chapter 4 focusing on plots. I found it useful.

My GitHub repository: https://github.com/kattilakoski/IODS-project


Assignment 2: Analysis

The dataset used in this assignment is from a survey of learning approaches and achievements of students on an introductory statistics course. In this assignment we use a subset of the results including information on gender, age, attitude, total points and approaches (deep, surface and strategic) of 166 students.

Reading data in and graphical overview of the data

# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
library(tidyverse)

#Read the data in:
learning2014 <- read_csv("./data/learning2014.csv")
#Explore dimensions:
dim(learning2014)
## [1] 166   7
#and structure
str(learning2014)
## spc_tbl_ [166 Ă— 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p

In the first scatter plot matrix each variable is plotted against each others. The variables are written in a diagonal in the plot. In the plots on the same row with the variable text box the variable is on the y axis of the plot. In the plots in the same column with the variable text box the variable is on the x-axis of the plot.

Based on this scatter plot matrix there are no clearly visible correlations between any of the variables. However, some seem to have some kind of upward or downward trend such as attitude and points as well as deep vs surface strategies.

Graphical overview of the data.
Graphical overview of the data.

In the second plot matrix created with ggpairs() the variables are on the top and left side of the plot. The top right side of the matrix shows the correlation between the continuous variables (age, attitude, deep, strategic, surface), the lower left side shows the scatter plots of the continuous variables, the diagonal the density plots of the continuous variables, and the leftmost column shows the histograms and box plots for the combinations between the categorical and the continuous variables.

From the matrix we can see that majority of the interviewees were female and around 20 years old. Age was not significantly correlating with any variable. Attitude had significant positive correlation with points and significant negative correlation with surface approach. Deep approach had significant negative correlation with surface approach. Strategic approach had negative correlation with surface approach. This matrix confirmed the trends that were visible in the previous scatter plot matrix

Regression model

# create a regression model with three explanatory variables
my_model2 <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
# create a regression model without the variables that did not have statistically significant relationship with points
my_model3 <- lm(points ~ attitude, data = learning2014)

# print out a summary of the model
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Based on the results of the t-test used in the linear regression the only significant correlation seems to be a positive correlation between attitude and points (P<0.001). The t-test compares the means of the two groups and tests whether the differences between them are related. As both the strategic and surface variables had a p-value greater than 0.1 they did not have statistically significant relationship with points variable.

The multiple R-squared for the first regression model with all three explanatory variables is 0.2074. This means that only little of the variability in the points is explained by these explanatory variables. Even though the relationship between attitude and points is significant the R-squared value of the regression model including only them is even smaller at 0.1906 meaning that attitude explains even less of the variability in the points.

Diagnostic plots

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5 for Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
par(mfrow = c(2,2))
plot(my_model2, c(1,2,5))

#same for the model with only attitude as the explanatory variable
par(mfrow = c(2,2))

plot(my_model3, c(1,2,5))

Diagnostic plots for the regression model with attitude as an explanatory variable.
Diagnostic plots for the regression model with attitude as an explanatory variable.

It seems that there is no distinctive pattern in the Residuals vs fitted plots in either case meaning that there is no non-linear relationship between the explanatory and target variables.

The normal Q-Q plots show somewhat straight lines following the dotted line with exceptions in the beginning and end. These exceptions may indicate that the data is not following a normal distribution.

In the residuals vs leverage plots there would be dotted lines in the upper and lower right corners showing if some of the observations had high Cook’s distance scores meaning they are influential to the regression results and are not following the trend in the rest of the cases. In these plots none of the cases seem to be very influential to the regression results.


Assignment 3 analysis

Reading in and description of the data

#install packages:
# install.packages("boot")
# install.packages("readr")

# access libraries
library(dplyr); library(ggplot2); library(readr); library(tidyr); library(boot)

#Read the joined student alcohol consumption data into R
alc <- read_csv("./data/student_alc.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print out the names of the variables in the data 
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This data set includes data on student achievements in two Portuguese schools. It includes 370 observations and 35 variables. Variables include for example student grades, sex and age. The dataset was formed by combining two datasets, one regarding the student performance in mathematics and the other in Portuguese language.

###Choosing variables for this analysis For this analysis I choose to focus on age, mother’s education, study time and absences. My hypothesis is that younger students drink less than older ones, the higher the education level of the mother the lower the alcohol consumption of the student is, students studying more drink less alcohol than students studying less and that the more absences the student has the more they drink.

Exploring the distributions of the variables and their relationship with alcohol use

#distribution of alcohol consumption and age
d1 <- ggplot(alc, aes(x = high_use, y = age)) + geom_boxplot() + ggtitle("Age and alcohol consumption")
d1

#distribution of alcohol consumption and mother's education
d2 <- ggplot(alc, aes(x = high_use, y = Medu)) + geom_boxplot() + ggtitle("Mother's education and student alcohol consumption")
d2

#distribution of alcohol consumption and study time
d3 <- ggplot(alc, aes(x = high_use, y = studytime)) + geom_boxplot() + ggtitle("Studytime and alcohol consumption")
d3

#distribution of alcohol consumption and absences
d4 <- ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot() + ggtitle("Absences and alcohol consumption")
d4

#Summary statistics
#age
# produce summary statistics by group
alc %>% group_by(high_use) %>% summarise(count = n(), mean(age))
## # A tibble: 2 Ă— 3
##   high_use count `mean(age)`
##   <lgl>    <int>       <dbl>
## 1 FALSE      259        16.5
## 2 TRUE       111        16.8
#mother's aducation
alc %>% group_by(high_use) %>% summarise(count = n(), mean(Medu))
## # A tibble: 2 Ă— 3
##   high_use count `mean(Medu)`
##   <lgl>    <int>        <dbl>
## 1 FALSE      259         2.80
## 2 TRUE       111         2.79
#study time
alc %>% group_by(high_use) %>% summarise(count = n(), mean(studytime))
## # A tibble: 2 Ă— 3
##   high_use count `mean(studytime)`
##   <lgl>    <int>             <dbl>
## 1 FALSE      259              2.16
## 2 TRUE       111              1.77
#absences
alc %>% group_by(high_use) %>% summarise(count = n(), mean(absences))
## # A tibble: 2 Ă— 3
##   high_use count `mean(absences)`
##   <lgl>    <int>            <dbl>
## 1 FALSE      259             3.71
## 2 TRUE       111             6.38

Based on the distribution plots it looks like my hypothesis is correct and that younger students drink less than older students. When looking at the numerical summary the difference in the mean age is only 0.27 years which seems small. Against my hypothesis it looks like there is no difference in the alcohol use of students with mother’s from different education levels and this is also visible in the summary as the difference between mean education level of mothers is only 0.01. Based on the distributions it looks like my hypothesis on study time was right as the students drinking less seem to study more and this is backed up by the summary as well as there is a difference in the mean study times of students with high and low alcohol use. However, the difference is only 0.40 hours. It seems that my hypothesis was right about the absences as students that drink more seem to have more absences. The difference between absences of students with high and low alchol use is the clearest and the mean absence of students with high alcohol use is 2.67 absences more than students with low alcohol use.

Logistic regression, odd ratios and coefficients

# find the model with glm()
m <- glm(high_use ~ age + Medu + studytime + absences, data = alc, family = "binomial")


# print out a summary of the fitted model
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + Medu + studytime + absences, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.798207   1.786727  -1.566 0.117323    
## age          0.164398   0.103533   1.588 0.112316    
## Medu        -0.003725   0.111021  -0.034 0.973232    
## studytime   -0.579521   0.158539  -3.655 0.000257 ***
## absences     0.075027   0.023247   3.227 0.001249 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 417.24  on 365  degrees of freedom
## AIC: 427.24
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
##  (Intercept)          age         Medu    studytime     absences 
## -2.798207282  0.164397602 -0.003725333 -0.579521121  0.075027298
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR       2.5 %   97.5 %
## (Intercept) 0.06091918 0.001766207 1.979230
## age         1.17868287 0.963461160 1.447356
## Medu        0.99628160 0.801834321 1.240253
## studytime   0.56016655 0.406319506 0.757511
## absences    1.07791358 1.032143878 1.130799

Based on the negative coefficient estimates of the summary statistics of the fitted model, higher values of mother’s education and student’s study time are associated with a lower likelihood of the high_use variable being TRUE and the opposite for the age and absences variables. However, only the coefficients for study time and absences are statistically significant. Therefore from the summary of the model one could say that the more time time a student spends studying the lower the likelihood of them having high usage of alcohol. And the higher the number of absence days of the student is the higher the likelihood that their alcohol usage is high. Based on these results my hypothesis about mother’s education and age were wrong and the hypothesis on study time and absences were right.

Based on the odds ratio for a one year increase in age the odds that the student have high alcohol usage increases by a factor of 1.18 when other variables stay stable. And the same idea for other variables. So each additional education level the odds that student has high alcohol usage increases by a factor of 1.00 meaning that there is no change in the alcohol usage, for each hour spent studying the odds decrease by a factor of 0.56 and for every increased absence the odds increase by a factor of 1.08 . Based on these results the biggest increases or decreases in odds are caused by age (the higher the age the higher the alcohol usage) and study time (the higher the study time the lower the alcohol usage).The confidence intervals for the intercept, age, Medu, studytime and absences are 0.00-1.98, 0.96-1.45, 0.80-1.24, 0.41-0.76 and 1.03-1.13 respectively. Interpretation for these values are that we are 95% confident that the correlation between the given variable and target variable is between the values of those confidence intervals. For example we are 95% confident that the correlation between age and alcohol usage is between 0.96-1.45. This is pretty high interval and therefore we don’t have very good knowledge of the effect. Same goes for mother’s education. The confidence intervals for study time and absences are smaller meaning we have better knowledge of their effect on the target variable (alcohol use).Based on these results my hypothesis about mother’s education was wrong and the hypotheses on age, study time and absences were right.

Exploring the predictive power of the model

# find the model with glm() for only the statistically significant study time and absences
m <- glm(high_use ~ studytime + absences, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, studytime, absences, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 Ă— 5
##    studytime absences high_use probability prediction
##        <dbl>    <dbl> <lgl>          <dbl> <lgl>     
##  1         2        3 FALSE          0.266 FALSE     
##  2         1        0 FALSE          0.336 FALSE     
##  3         1        7 TRUE           0.468 FALSE     
##  4         3        1 FALSE          0.149 FALSE     
##  5         1        6 FALSE          0.448 FALSE     
##  6         3        2 FALSE          0.159 FALSE     
##  7         2        2 FALSE          0.251 FALSE     
##  8         2        3 FALSE          0.266 FALSE     
##  9         1        4 TRUE           0.410 FALSE     
## 10         1        2 TRUE           0.372 FALSE
#2x2 tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   248   11
##    TRUE     93   18
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2810811

Based on the cross tabulation/confusion matrix the model predicts farely well the cases were alcohol usage is not high only 11 false positives out of the 259 students that reported low alcohol usage. The model does not predict the high use cases very well as out of the 111 high use cases it only got 18 right and 93 false positives.

The model has a prediction error of 0.28 which means that it makes wrong predictions 28% of the times. This is a rather high prediction error and I would conclude that this model does not predict very well and therefore is not very powerful.

I think the “error rate of my guesses” and of the model were pretty similar. I would say that the model is slightly better for prediction than guessing but would have much room for improvement. This could maybe be done for example by including other variables.


Assignment 4 analysis exercises

Reading in and description of the data

# Install required packages:
#install.packages(c("MASS", "corrplot"))
#install.packages("plotly")
# Access required packages
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(tidyverse)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(dplyr)
library(GGally)

# Load the Boston data from the MASS package
data("Boston")

# Explore the structure and the dimensions of the data 
glimpse(Boston) # 506 rows amd 14 columns
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The dataset used in this assingment and the explanations for variablenames are found in here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html The dataset has 506 rows and 14 columns and in includes information on the housing values in suburbs of Boston. The columns include for example per capita crime rate by town (crim) and average number of rooms per dwelling (rm).

# Show a graphical overview of the data 
pairs(Boston)

p <- ggpairs(Boston)
# draw the plot
p

# Show summaries of the variables in the data.
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Based on the plot matrix it seems that all variables have statistically significant correlations with each other except chas (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)) which doesn’t correlate at all or has a weaker correlation than that of other variables. Chas has values only of 0 and 1. From the density plots we can see for example that crim, zn, chas, dis, rad, tax and lstat are mostly very low (although some have a second small peak), only rm looks to be normally distributed, indus has two even peaks, and age, ptratio and black have mostly higher values.

Same trends are showing in the corrplot. However, maybe the strength (intensity of color) and sign of the correlation (blue for pos and red for neg) is more easily visible. Strongest negative correlations (meaning when one gets higher the other gets lower) seem to be between indus and dis, nox and dis, age and dis, and lstat and medv. Strongest positive correlation (one gets higher the other gets higher too) seem to be between rad and tax.

# Standardize the dataset
boston_scaled <- scale(Boston)

# Print out summaries of the scaled data 
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime', using the quantiles as the break points in the categorical variable
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# Drop the old crime rate variable from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# Divide the dataset to train and test sets, so that 80% of the data belongs to the train set
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

As a result of scaling the variable values got smaller and the mean of each variable is zero. This is because in the scaling the column means were subtracted from the corresponding columns and the difference was divided with standard deviation.

# Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2673267 0.2425743 0.2475248 0.2425743 
## 
## Group means:
##                  zn      indus        chas        nox          rm        age
## low       0.9691929 -0.9089157 -0.16296517 -0.8773034  0.43310935 -0.8994803
## med_low  -0.1265035 -0.2695380 -0.11163110 -0.5611315 -0.16607253 -0.3191996
## med_high -0.3929105  0.1623854  0.20012296  0.4317047  0.02060242  0.4137680
## high     -0.4872402  1.0171960 -0.07145661  1.1181640 -0.46565218  0.8504928
##                 dis        rad        tax    ptratio       black       lstat
## low       0.8441420 -0.6979446 -0.7097905 -0.4452153  0.37788976 -0.77070038
## med_low   0.3209694 -0.5459225 -0.4470790 -0.1015357  0.31272351 -0.10549823
## med_high -0.4058625 -0.4156770 -0.3140979 -0.1873182  0.08204918  0.07367767
## high     -0.8724425  1.6373367  1.5134896  0.7798552 -0.71097670  0.88792293
##                 medv
## low       0.50655731
## med_low  -0.01444002
## med_high  0.06797723
## high     -0.71375065
## 
## Coefficients of linear discriminants:
##                 LD1           LD2         LD3
## zn       0.13793450  0.7107207358 -1.00129849
## indus   -0.01680252 -0.1570751110  0.39172685
## chas    -0.08386464 -0.1174589551  0.02734767
## nox      0.42230698 -0.6782361108 -1.53877206
## rm      -0.09547367 -0.0708649040 -0.19359725
## age      0.30018977 -0.2628696590 -0.08087518
## dis     -0.08781986 -0.1303867265  0.17827567
## rad      3.17275936  0.9793809225  0.01328385
## tax     -0.02452146  0.0008232436  0.66291808
## ptratio  0.13012651 -0.0202560357 -0.49656307
## black   -0.12449650  0.0282325115  0.14195264
## lstat    0.18079071 -0.2599928093  0.35072171
## medv     0.13895975 -0.3429475294 -0.14905851
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9444 0.0398 0.0159
#Draw the LDA (bi)plot.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

# Save the crime categories from the test set 
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# Then predict the classes with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)

# Cross tabulate the results with the crime categories from the test set.
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11       7        1    0
##   med_low    9      14        5    0
##   med_high   1       8       16    1
##   high       0       0        0   29
#Comment on the results. 

Based on the cross tabulations the LDA model worked best on predicting the high crime rates as it predicted all of them right. Of the low crime rates the model predicted 15 out of 25 right, of the med_low 17 out of 28 and of the med_high rates 10 out of 23 were predicted right by the LDA model.

# Reload the Boston dataset 
data("Boston")

# Standardize the dataset to get comparable distances
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

# Calculate the distances between the observations
# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
#Run k-means algorithm on the dataset.
# k-means clustering
km <- kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

#Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line') # here it seems 2 is an optimal number of clusters
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# k-means clustering
km <- kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

#p <- ggpairs(boston_scaled, col = km$cluster)
#p

The optimal number of clusters is visible in the qplot where the line drops suddenly. In this case it is around 2. Therefore 2 is the optimal number of clusters.

# Standardize the dataset
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

# Perform k-means on the original Boston data with some reasonable number of clusters (> 2)
# k-means clustering
km <- kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

# add the cluster value to scaled data
boston_scaled <- data.frame(boston_scaled, km$cluster)

# Perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. 
# linear discriminant analysis
lda.fit <- lda(km.cluster ~ ., data = boston_scaled[,-4])   #here I got an error of variable 4 (aka chas) being constant within groups and that is why I removed it. I am not sure if this is right way to do it but now it doesn't give me an error

# print the lda.fit object
lda.fit
## Call:
## lda(km.cluster ~ ., data = boston_scaled[, -4])
## 
## Prior probabilities of groups:
##          1          2          3 
## 0.06916996 0.61067194 0.32015810 
## 
## Group means:
##         crim         zn      indus        nox         rm        age        dis
## 1 -0.2048299 -0.1564737  0.2306535  0.3342374  0.3344149  0.3170678 -0.3634565
## 2 -0.3882449  0.2731699 -0.6264383 -0.5823006  0.2188304 -0.4585819  0.4807157
## 3  0.7847946 -0.4872402  1.1450405  1.0384727 -0.4896488  0.8062002 -0.8383961
##           rad        tax    ptratio      black      lstat       medv
## 1 -0.02700292 -0.1304164 -0.4453253  0.1787986 -0.1976385  0.6422884
## 2 -0.58641200 -0.6161585 -0.2814183  0.3151747 -0.4640135  0.3182241
## 3  1.12436056  1.2034416  0.6329916 -0.6397959  0.9277624 -0.7457491
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim     0.02479166 -0.13204141
## zn       0.42787622 -0.04638198
## indus    1.15646011  0.64348753
## nox      0.47943272  0.42987408
## rm       0.13637610 -0.12823804
## age     -0.06654278  0.30029385
## dis     -0.01915297  0.20848367
## rad      0.74637979  0.69845574
## tax      0.27967651 -1.02040695
## ptratio  0.19355485 -0.21964359
## black   -0.04753224  0.11581547
## lstat    0.47213016  0.02172623
## medv     0.06797263  0.92531426
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9822 0.0178
# Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution).
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}


# plot the lda results 
plot(lda.fit, dimen = 2, col = boston_scaled$km.cluster, pch = km$cluster)
lda.arrows(lda.fit, myscale = 2)

# Interpret the results. Which variables are the most influential linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)

Based on the biplot it looks like the variables medv, rad, indus and tax are the most influential variables which shows inthe plot as longer arrows.

#Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  2
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
#Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. 
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
#Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?
#kmeans for train set
# Perform k-means on the original Boston data with some reasonable number of clusters (> 2)
# k-means clustering
set.seed(123)
km_train <- kmeans(train[,-14], centers = 3)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km_train$cluster)
## Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
## Inf
## Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
## -Inf

Yes, the plots differ in that the dots color in different way in the plots but the spatial location of the dots seem to stay the same in both the plots.


Assignment 5 analysis

1. Graphical overview and summaries of the data

# Load libraries:

library(tibble)

# read in data:
human <- read_csv("./data/human.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# country names to rownames

human <- column_to_rownames(human, "country")

#summaries of variables in data
summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth          Parli      
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# graphical summary
library("GGally")
p <- ggpairs(human)

# calculate the correlation matrix and round it
cor_matrix <- cor(human) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##           Edu2.FM Labo.FM Life.Exp Edu.Exp   GNI Mat.Mor Ado.Birth Parli
## Edu2.FM      1.00    0.01     0.58    0.59  0.43   -0.66     -0.53  0.08
## Labo.FM      0.01    1.00    -0.14    0.05 -0.02    0.24      0.12  0.25
## Life.Exp     0.58   -0.14     1.00    0.79  0.63   -0.86     -0.73  0.17
## Edu.Exp      0.59    0.05     0.79    1.00  0.62   -0.74     -0.70  0.21
## GNI          0.43   -0.02     0.63    0.62  1.00   -0.50     -0.56  0.09
## Mat.Mor     -0.66    0.24    -0.86   -0.74 -0.50    1.00      0.76 -0.09
## Ado.Birth   -0.53    0.12    -0.73   -0.70 -0.56    0.76      1.00 -0.07
## Parli        0.08    0.25     0.17    0.21  0.09   -0.09     -0.07  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Graphical overview of the data.
Graphical overview of the data.

In the plot matrix created with ggpairs() the variables are on the top and left side of the plot. The top right half of the matrix shows the correlation between the variables, the lower left half shows the scatter plots of the continuous variables, the diagonal the density plots of the continuous variables.

From the matrix we can see that many variables correlate statistically significantly with each other. There is both negative and positive correlations. From the distribution plots in the diagonal as well as summaries of the variables we can see that none of the variables are normally distributed.

The correlation plot visualises the correlations between variable better and we can see for example that life expectancy and expected years of schooling is strongly negatively correlated with maternal mortality ratio and adolescent birth rate.

2. Principal component analysis (PCA) and a biplot for non-standardised data

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# draw a biplot of the principal component representation and the original variables, with PC1 and PC2
#commented out but graph below as an image
#p <- biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
PCA biplot with non-standardised data on human development indices. Countries are shown in grey and the variables (Edu2.FM = ratio of female and male populations with secondary education, Labo.FM = ratio of labor force participation of females and males, Life.Exp= Life expectancy at birth, Edu.Exp = Expected years of schooling, GNI = Gross National Income per capita, Mat.Mor = Maternal Mortality Ratio, Ado.Birth = Adolescent Birth Rate, Parli = Percent Representation in Parliament) in pink.
PCA biplot with non-standardised data on human development indices. Countries are shown in grey and the variables (Edu2.FM = ratio of female and male populations with secondary education, Labo.FM = ratio of labor force participation of females and males, Life.Exp= Life expectancy at birth, Edu.Exp = Expected years of schooling, GNI = Gross National Income per capita, Mat.Mor = Maternal Mortality Ratio, Ado.Birth = Adolescent Birth Rate, Parli = Percent Representation in Parliament) in pink.

3. & 4. PCA and biplot with standardised data

# standardise the variables:
human_std <- scale(human)
summary(human_std)
##     Edu2.FM           Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth           Parli        
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
human_std <- as.data.frame(human_std)

# repeat PCA:
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

# draw a biplot of the principal component representation and the original variables, with PC1 and PC2
#commented out but graph below as an image
#p <- biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
PCA biplot of standardised data on human development indices. Countries are shown in grey and the variables (Edu2.FM = ratio of female and male populations with secondary education, Labo.FM = ratio of labor force participation of females and males, Life.Exp= Life expectancy at birth, Edu.Exp = Expected years of schooling, GNI = Gross National Income per capita, Mat.Mor = Maternal Mortality Ratio, Ado.Birth = Adolescent Birth Rate, Parli = Percent Representation in Parliament) in pink.
PCA biplot of standardised data on human development indices. Countries are shown in grey and the variables (Edu2.FM = ratio of female and male populations with secondary education, Labo.FM = ratio of labor force participation of females and males, Life.Exp= Life expectancy at birth, Edu.Exp = Expected years of schooling, GNI = Gross National Income per capita, Mat.Mor = Maternal Mortality Ratio, Ado.Birth = Adolescent Birth Rate, Parli = Percent Representation in Parliament) in pink.

Interpretations of PCA biplots

The PCA biplots created with standardised and non-standardised data do differ. This is likely due to that standardization makes the values more comparable and therefore the PCA biplots look different as well. In the non standardised biplot only the gross national income per capita seems to have influence on the first principal component and no influence on the second principal component. Looks like the other variables have no influence on the principal components. There is no clear separate clusters but two groups could be seen, one forming a vertical group and one forming a diagonal group.

Here included interpratations for exercise 4. as well:

In the PCA biplot with standardised data it is visible that more variables have influence on the principal components (PC). Most of the variables seem to influence the first PC, only Percent Representation in Parliament (Parli) and ratio of labor force participation of females and males(Labo.FM) don’t seem to influence PC1 much. Parli and Labo.FM seem to influence the second PC. Maternal Mortality Ratio (Mat.Mor) and Adolescent Birth Rate seem to positively correlate as well as Life expectancy at birth (Life.Exp), ratio of female and male populations with secondary education, Gross National Income per capita and Expected years of schooling. Variables in these two groups also negatively correlate with the variables from the other group (eg. Mat.Mor negatively correlates with Life.Exp). Parli and Labo.FM are positively correlated but do not correlate with other variables.

5. Multiple correspondence analysis

About the data: “The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).”

# Load libraries:
library(dplyr)
library(tidyr)
library(ggplot2)
library(FactoMineR)
#library(factoextra)
# Load tea data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

# Look at structure and dimensions of the data
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea) # 300 obs. 36 var.
## [1] 300  36
View(tea)

# From now on I will focus on the "Tea", "How", "how", "sugar", "where" and "lunch" columns:
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset

pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
# these are commented out to avoid duplicate graphs as I attached them as images
#p1 <- plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
#p2 <- plot(mca, graph.type = "classic", habillage = "quali")

MCA plot with only variables. MCA with both individuals and variables. In the MCA plot it is visible that the varibales other, unpackaged and teashop are situated close to each other this means that they are similar to each other in relation to all variables. It could be that people that buy tea from tea shop also by tea unpackaged. These variables are far from the origin and they have effect on on both dimensions. The variables other, chainstore+tea shop, and tea bag+unpackaged are also situated closely to each other meaning they are similar to each other in realtion to other variables and they influence the dimension 2. Green is situated separately from all other variables meaning that it could be somehow different from all other variables and it influences mostly the dimension 1. The map with both variables and individuals shows how the individuals are situated in the space in in relation to each other, the variables, and dimensions.


Assignment 6

1. Meet and Repeat: PART I with RATS data

1.1. Reading data in

# Read the RATS data
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')

# Look at the (column) names of BPRS
names(RATS)
##  [1] "ID"    "Group" "WD1"   "WD8"   "WD15"  "WD22"  "WD29"  "WD36"  "WD43" 
## [10] "WD44"  "WD50"  "WD57"  "WD64"
# Look at the structure of BPRS
str(RATS)
## 'data.frame':    16 obs. of  13 variables:
##  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD1  : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ WD8  : int  250 230 250 255 260 265 275 255 415 420 ...
##  $ WD15 : int  255 230 250 255 255 270 260 260 425 430 ...
##  $ WD22 : int  260 232 255 265 270 275 270 268 428 440 ...
##  $ WD29 : int  262 240 262 265 270 275 273 270 438 448 ...
##  $ WD36 : int  258 240 265 268 273 277 274 265 443 460 ...
##  $ WD43 : int  266 243 267 270 274 278 276 265 442 458 ...
##  $ WD44 : int  266 244 267 272 273 278 271 267 446 464 ...
##  $ WD50 : int  265 238 264 274 276 284 282 273 456 475 ...
##  $ WD57 : int  272 247 268 273 278 279 281 274 468 484 ...
##  $ WD64 : int  278 245 269 275 280 281 284 278 478 496 ...
# Print out summaries of the variables
summary(RATS)
##        ID            Group           WD1             WD8             WD15      
##  Min.   : 1.00   Min.   :1.00   Min.   :225.0   Min.   :230.0   Min.   :230.0  
##  1st Qu.: 4.75   1st Qu.:1.00   1st Qu.:252.5   1st Qu.:255.0   1st Qu.:255.0  
##  Median : 8.50   Median :1.50   Median :340.0   Median :345.0   Median :347.5  
##  Mean   : 8.50   Mean   :1.75   Mean   :365.9   Mean   :369.1   Mean   :372.5  
##  3rd Qu.:12.25   3rd Qu.:2.25   3rd Qu.:480.0   3rd Qu.:476.2   3rd Qu.:486.2  
##  Max.   :16.00   Max.   :3.00   Max.   :555.0   Max.   :560.0   Max.   :565.0  
##       WD22            WD29            WD36            WD43      
##  Min.   :232.0   Min.   :240.0   Min.   :240.0   Min.   :243.0  
##  1st Qu.:267.2   1st Qu.:268.8   1st Qu.:267.2   1st Qu.:269.2  
##  Median :351.5   Median :356.5   Median :360.0   Median :360.0  
##  Mean   :379.2   Mean   :383.9   Mean   :387.0   Mean   :386.0  
##  3rd Qu.:492.5   3rd Qu.:497.8   3rd Qu.:504.2   3rd Qu.:501.0  
##  Max.   :580.0   Max.   :590.0   Max.   :597.0   Max.   :595.0  
##       WD44            WD50            WD57            WD64      
##  Min.   :244.0   Min.   :238.0   Min.   :247.0   Min.   :245.0  
##  1st Qu.:270.0   1st Qu.:273.8   1st Qu.:273.8   1st Qu.:278.0  
##  Median :362.0   Median :370.0   Median :373.5   Median :378.0  
##  Mean   :388.3   Mean   :394.6   Mean   :398.6   Mean   :404.1  
##  3rd Qu.:510.5   3rd Qu.:516.0   3rd Qu.:524.5   3rd Qu.:530.8  
##  Max.   :595.0   Max.   :612.0   Max.   :618.0   Max.   :628.0

1.2 Graphical displays of longitudinal data: The magical pivot_longer()

# Access the packages dplyr and tidyr
library(dplyr)
library(tidyr)

# Factor variables ID and Group
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)

# Convert to long form
RATSL <- pivot_longer(RATS, cols = -c(ID, Group), 
                      names_to = "WD",
                      values_to = "Weight") %>% 
         mutate(Time = as.integer(substr(WD,3,4))) %>%
         arrange(Time)

# Take a glimpse at the RATSL data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …

1.3 Individuals on the plot

#Access the package ggplot2
library(ggplot2)

# Draw the plot
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") #+ 

  scale_y_continuous()
## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

From this plot we can see that group 2 and 3 differ clearly from the group one already with starting weights. Group 1 also has more individuals in it. In group 2 there is an individual that seems to be an outlier with much higher weight throughout the study.

1.4 The Golden Standardise

library(dplyr)
library(tidyr)
# Standardise the variable RATSL
RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdrats = (Weight - mean(Weight))/sd(Weight)) %>%
  ungroup()

# Glimpse the data
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID      <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3…
## $ Group   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1,…
## $ WD      <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1",…
## $ Weight  <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 47…
## $ Time    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8,…
## $ stdrats <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, -0…
# Plot again with the standardised bprs
library(ggplot2)
ggplot(RATSL, aes(x = Time, y = stdrats, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized weight")

In the plot with standardized values the lines are more straight as the weight have been standardised in relation to all other weights. Whereas in the previous non standardised plot the weight had an increasing trend, in this plot some of the standardised weights have even a decreasing trend.

1.5 Summary graphs

library(dplyr)
library(tidyr)
# Summary data with mean and standard error of bprs by treatment and week 
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = Weight, se = Weight ) %>%
  ungroup()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Group', 'Time'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATSS)
## Rows: 176
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Time  <int> 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 15, 15, 15, 15, …
## $ mean  <int> 240, 225, 245, 260, 255, 260, 275, 245, 250, 230, 250, 255, 260,…
## $ se    <int> 240, 225, 245, 260, 255, 260, 275, 245, 250, 230, 250, 255, 260,…
# Plot the mean profiles
library(ggplot2)
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

In the mean profiles plot it is again visible that the group 1 is different from the groups 2 and 3 with lower mean weights.

1.6 Find the outlier

library(dplyr)
library(tidyr)
# Create a summary data by Group and ID with mean as the summary variable (ignoring baseline week 0)
RATSL8S <- RATSL %>%
  #filter(week > 0) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.…
# Draw a boxplot of the mean versus treatment
library(ggplot2)
ggplot(RATSL8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), weeks 1-9")

# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL8S1 <- filter(RATSL8S, mean <= 550)

# Glimpse the data
glimpse(RATSL8S1)
## Rows: 15
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.…
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), weeks 1-9")

In the first boxplot of the mean versus the group, the outlier in group 2 is visible. Looks like groups 1 and 3 also have outliers although less so that group 2. After removing the outlier of group 2 the box representing the deviation got much smaller.

1.7 Anova

library(dplyr)
library(tidyr)
# Add the baseline from the original data as a new variable to the summary data
RATSL8S2 <- RATSL8S %>%
  mutate(baseline = RATS$WD1)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATSL8S2)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 252125  252125 2237.0655 5.217e-15 ***
## Group      2    726     363    3.2219   0.07586 .  
## Residuals 12   1352     113                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the anova results the baseline weights are strongly related to the mean weights in the following weeks.

2. Meet and Repeat: PART II with BPRS data

2.1 Reading data in

BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)

# Look at the (column) names of BPRS
names(BPRS)
##  [1] "treatment" "subject"   "week0"     "week1"     "week2"     "week3"    
##  [7] "week4"     "week5"     "week6"     "week7"     "week8"
# Look at the structure of BPRS
str(BPRS)
## 'data.frame':    40 obs. of  11 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ week0    : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week1    : int  36 68 55 77 75 43 61 36 43 51 ...
##  $ week2    : int  36 61 41 49 72 41 47 38 39 51 ...
##  $ week3    : int  43 55 38 54 65 38 30 38 35 55 ...
##  $ week4    : int  41 43 43 56 50 36 27 31 28 53 ...
##  $ week5    : int  40 34 28 50 39 29 40 26 22 43 ...
##  $ week6    : int  38 28 29 47 32 33 30 26 20 43 ...
##  $ week7    : int  47 28 25 42 38 27 31 25 23 39 ...
##  $ week8    : int  51 28 24 46 32 25 31 24 21 32 ...
# Print out summaries of the variables
summary(BPRS)
##    treatment      subject          week0           week1           week2     
##  Min.   :1.0   Min.   : 1.00   Min.   :24.00   Min.   :23.00   Min.   :26.0  
##  1st Qu.:1.0   1st Qu.: 5.75   1st Qu.:38.00   1st Qu.:35.00   1st Qu.:32.0  
##  Median :1.5   Median :10.50   Median :46.00   Median :41.00   Median :38.0  
##  Mean   :1.5   Mean   :10.50   Mean   :48.00   Mean   :46.33   Mean   :41.7  
##  3rd Qu.:2.0   3rd Qu.:15.25   3rd Qu.:58.25   3rd Qu.:54.25   3rd Qu.:49.0  
##  Max.   :2.0   Max.   :20.00   Max.   :78.00   Max.   :95.00   Max.   :75.0  
##      week3           week4           week5           week6           week7     
##  Min.   :24.00   Min.   :20.00   Min.   :20.00   Min.   :19.00   Min.   :18.0  
##  1st Qu.:29.75   1st Qu.:28.00   1st Qu.:26.00   1st Qu.:22.75   1st Qu.:23.0  
##  Median :36.50   Median :34.50   Median :30.50   Median :28.50   Median :30.0  
##  Mean   :39.15   Mean   :36.35   Mean   :32.55   Mean   :31.23   Mean   :32.2  
##  3rd Qu.:44.50   3rd Qu.:43.00   3rd Qu.:38.00   3rd Qu.:37.00   3rd Qu.:38.0  
##  Max.   :76.00   Max.   :66.00   Max.   :64.00   Max.   :64.00   Max.   :62.0  
##      week8      
##  Min.   :20.00  
##  1st Qu.:22.75  
##  Median :28.00  
##  Mean   :31.43  
##  3rd Qu.:35.25  
##  Max.   :75.00

2.2 Plots and Linear Mixed Effects Models

library(dplyr)
library(tidyr)
# Factor treatment & subject
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)

# Convert to long form
BPRSL <-  pivot_longer(BPRS, cols = -c(treatment, subject),
                       names_to = "weeks", values_to = "bprs") %>%
          arrange(weeks) #order by weeks variable

# Extract the week number
BPRSL <-  BPRSL %>% 
            mutate(week = as.integer(substr(weeks,5,5)))

# Take a glimpse at the BPRSL data
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# Plot the data
library(ggplot2)

ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line()

ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "top")

  ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

# create a regression model RATS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)

# print out a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Based on the coefficient estimates of the summary statistics of the fitted model, higher number of week is associated with a lower likelihood for bprs being high. This coefficient is statistivally significant with the p-value of <2e-16. Therefore from the summary of the model one could say that the later the evaluation was done the lower the likelihood of them having high result in the BPRS. At this point it is good to keep in mind that this model assumes independence of the repeated measures which in this case is not very likely.

2.3 More appropriate models: random intercept and random slope model

library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000
# create a random intercept and random slope model
library(lme4)
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova results show that the model with random intercept and slope (BPRS_ref1) with the p-value of 0.02636 is significantly better than the model with just random intercept (BPRS_ref).

2.4 Random Intercept and Random Slope Model with interaction

# create a random intercept and random slope model with the interaction
library(lme4)
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | treatment), data = BPRSL, REML = FALSE)
## boundary (singular) fit: see help('isSingular')
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | treatment)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2843.3   2870.5  -1414.7   2829.3      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8252 -0.7277 -0.2586  0.5697  4.0818 
## 
## Random effects:
##  Groups    Name        Variance  Std.Dev. Corr 
##  treatment (Intercept) 2.824e-02  0.16804      
##            week        1.765e-03  0.04201 -1.00
##  Residual              1.516e+02 12.31302      
## Number of obs: 360, groups:  treatment, 2
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.3664  33.996
## week         -2.2704     0.2531  -8.971
## treatment2    0.5722     1.2979   0.441
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.741       
## treatment2 -0.475  0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref2: bprs ~ week + treatment + (week | treatment)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## BPRS_ref2    7 2843.3 2870.5 -1414.7   2829.3                     
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 97.896  0
# draw the plot of BPRSL with the observed Weight values
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line() +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "BPRS") +
  theme(legend.position = "top")

# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)

library(dplyr)
library(tidyr)
# Create a new column fitted to RATSL
BPRSL <- BPRSL %>% mutate(Fitted = Fitted)

# draw the plot of BPRSL with the Fitted values of weight
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +
  geom_line() +
  scale_x_continuous(name = "Time (weeks)") +
  scale_y_continuous(name = "BPRS") +
  theme(legend.position = "top")

I can’t get this last plot to work correctly. Based on the anova test on the models BPRS_ref2 and BPRS_ref1 the BPRS_ref2 model did not improve from the first one so it can be discarded.